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Abstract 


Hierarchies of concepts are useful in many appli¬ 
cations from navigation to organization of objects. 
Usually, a hierarchy is created in a centralized man¬ 
ner by employing a group of domain experts, a 
time-consuming and expensive process. The ex¬ 
perts often design one single hierarchy to best ex¬ 
plain the semantic relationships among the con¬ 
cepts, and ignore the natural uncertainty that may 
exist in the process. In this paper, we propose a 
crowdsourcing system to build a hierarchy and fur¬ 
thermore capture the underlying uncertainty. Our 
system maintains a distribution over possible hi¬ 
erarchies and actively selects questions to ask us¬ 
ing an information gain criterion. We evaluate our 
methodology on simulated data and on a set of real 
world application domains. Experimental results 
show that our system is robust to noise, efficient 
in picking questions, cost-effective and builds high 
quality hierarchies. 


1 Introduction 

Hierarchies of concepts and objects are useful across many 
real-world applicatio ns and scientific domains. Online shop¬ 
ping portals such as [Amazon, 2015| use product catalogs to 
organize their products into a hierarchy, aiming to simplify 
the task of search and navigation for their customers. Sharing 
the goal of organizing objects and information, hierarchies 
are prevalent i n many other d omains such as in libraries to or¬ 
ganize books [Dewey, 1876| or web portals to organize doc¬ 
uments by topics. Concept hierarchies also serve as a natural 
semantic prior over concepts, helpful in a wide range of Arti¬ 
ficial Int elligence (AI) domains, such as natural languag e pro¬ 
cessing [Bloehdom et al, 20Q5| and computer vision [Deng 
et al. , 2009t|Laiefa/.,2011b|. 


Task-dependent hierarchies, as in product catalogs, are ex¬ 
pensive and time-consuming to construct. They are usually 
built in a centralized manner by a group of domain experts. 
This process makes it infeasible to create separate hierar¬ 
chies for each specific domain. On the other hand, in the 
absence of such specific hierarchies, many applications use 
a ge neral-purpose pre -built hierarchy (for example, Word- 
Net i Fellbaum, 1998[ ) that may be too abstract or inappro¬ 
priate for specific needs. An important question in this con¬ 


text is thus How can we cost-efficiently build task-dependent 
hierarchies without requiring domain experts? 

A ttempts to build h ierarchies using fully automatic meth¬ 
ods [Blei et al, 2003 1 have failed to capture the relationships 
between concepts as perceived by people. The resulting hier¬ 
archies perform poorly when deployed in real-world systems. 
With the recent popularity of crowdsourcing platforms, such 
as Amazon Mechanical Turk (AMT), efforts have been made 
in employing non-expert workers (the crowd) at scale and 
low cost, to b u ild hie rarchies guided by human knowledge. 
Chilton et aL\ (2013| propose the CASCADE workfiow that 
converts the process of building a hierarchy into the task of 
multi-label annotation for objects. However, acquiring multi¬ 
label annotations for objects is expensive and might be unin¬ 
formative for creating hierarchies. This leads to the question 
How can we actively select simple and useful questions that 
are most informative to the system while minimizing the cost? 

Most existing methods (including CASCADE) as well as 
methods employing domain experts usually generate only a 
single hierarchy aiming to best explain the data or the rela¬ 
tionships among the concepts. This ignores the natural ambi¬ 
guity and uncertainty that may exist in the semantic relation¬ 
ships among the concepts, leading to the question How can 
we develop probabilistic methods that can account for this 
uncertainty in the process of building the hierarchy? 

Our Contributions. In this paper, we propose a novel 
crowdsourcing system for inferring hierarchies of concepts, 
tackling the questions posed above. We develop a principled 
algorithm powered by the crowd, which is robust to noise, 
efficient in picking questions, cost-effective, and builds high 
quality hierarchies. We evaluate our proposed approach on 
synthetic problems, as well as on real-world domains with 
data collected from AMT workers, demonstrating the broad 
applicability of our system. 

The remainder of this paper is structured as follows: Af¬ 
ter discussing related work in Section we will present our 
method in Section We continue with experiments in Sec¬ 
tion |4] and conclude in Section O 


2 Related Work 

Concept hierarchies have been helpful in solving natural lan¬ 
guage processing tasks, for example, dis ambiguating word 
sense in text retrieval [Voorhees, 1993| , information ex¬ 
traction (Bloehdom et al, 2005| , and machine transla- 





























tion [Knight, 1993| . Meanwhile, hierarchies between ob¬ 
ject classes have been deployed in the computer vision com¬ 
munity to improve object categoriz ation with thousands o f 
classes and limited training i mages I Rohrbach et al ., “WTl, 
scalable image classification jDeng et al, 2009} Deng et 


al, 201 3[ |Lai et al, 201 lb| , and image annotation effi- 


ciency [Deng et al, 2014| . In these methods, it is usually 
assumed that the hierarchies have already been built, and the 
quality of the hierarchies can influence the performance of 
these methods significantly. 

The traditional way of hierarchy creation is to hire a small 
gro up of experts to b uild the hierarchy in a centralized man- 


^ , 

ner [Fellbaum, 1998| , which is expensive and time consum¬ 
ing. Therefore, people develop automatic or semi-automatic 
methods to build hierarchies. For instance, vision based 
methods, such as, |Sivic et al\ (2008| and |B art et al ] 120081 
build an object hierarchy using visual feature similarities. 
However, visually similar concepts are not necessarily sim¬ 
ilar in semantics. 

Another type of methods for hierarchy creation is related 


to ontology learning from text and the web I Buitelaar et al, 
2005 [ Wong et al, 2Q\2\ Carlson et al, 20TO] r The goal 


of ontology learning is to extract terms and relationships be 
tween these concepts. However, the focus of these techniques 
is on coverage, rather than accuracy, and the hierarchies that 
can be extracted from these approaches are typically not very 
accurate. Since the taxonomy is the most important relation¬ 
ship among ontologies, many works have been focusing on 
building taxon omy hierarchies. F or example, co-occurrence 
based methods [Budanitsky, 1999| use word co-occurrence to 
define the similarity between words, and build hierarchies us¬ 
ing clustering. These methods usually do not perform wel l 
because they lack in common sense [Wong et al, '20l2l. 
On th e other hand, template-based methods [Hippisley et al, 


2005 1 deploy domain knowledge and can achieve higher ac¬ 


curacy. Yet, it is hard to adapt template-based methods to new 
domains. Knowing the fact that humans are good at common 
sense, and domain adaptation, involvement of humans i n hier¬ 
archy learni ng remains highly necessary and desirable [Wong 


et al, 201^ . 


The popularity of crowdsourcing platforms has made 
cheap human resource s available for buildin g hierarchies. For 
example. Cascade [Chilton et al, uses multi-label 

annotations for items, and de ploys label co-occu rrence to 
generate a hierarchy. Deluge [Bragg et <2/., 2013| improves 
the multi-label annotation step in CASCADE using decision 
theory and machine learning to reduce the labeling effort. 
However, for both pipelines, co-occurrence of labels does not 
necessarily imply a connection in the hierarchy. Furthermore, 
both methods can build only a single hierarchy, not consider¬ 
ing the uncertainty naturally existing in hierarchies. 


Orthogonal to building hierarchies, Mortensen et al 
I 2006 ) use crowdsourcing to verify an existing ontology. 
Their empirical results demonstrate that non-expert workers 
are able to verify structures within a hierarchy built by do¬ 
main experts. Inspired by their insights, it is possible to gather 
information of the hierarchy structure by asking simple true- 
or-false questions about the “ascendant-descendant” relation¬ 
ship between two concepts. In this work, we propose a novel 


method of hierarchy creation based on asking such questions, 
and fusing the information together. 


3 Approach 

The goal of our approach is to learn a hierarchy over a do¬ 
main of concepts, using input from non-expert crowdsourc¬ 
ing workers. Estimating hierarchies through crowdsourcing 
is challenging, since answers given by workers are inher¬ 
ently noisy, and, even if every worker gives her/his best pos¬ 
sible answer, concept relationships might be ambiguous and 
there might not exist a single hierarchy that consistently ex¬ 
plains all the workers’ answers. We deal with these problems 
by using a Bayesian framework to estimate probability dis¬ 
tributions over hierarchies, rather than determining a single, 
best guess. This allows our approach to represent uncertainty 
due to noisy, missing, and possibly inconsistent information. 
Our system interacts with crowdsourcing workers iteratively 
while estimating the distribution over hierarchies. At each it¬ 
eration, the system picks a question related to the relationship 
between two concepts in the hierarchy, presents it to multiple 
workers on a crowdsourcing platform, and then uses the an¬ 
swers to update the distribution. The system keeps asking 
questions until a stopping criterion is reached. In this work 
we set a threshold for the number of asked questions. 


3.1 Modelling Distributions over Hierarchies 

The key challenge for estimating distributions over hierar¬ 
chies is the huge number of possible hierarchies, or trees, 
making it intractable to directly represent the distribution as 
a multinomial. Consider the number of possible hierarchies 
of N concepts is {N (we add a fixed root node to 

the concept set), which results in 1,296 trees for 5 concepts, 
but already 2.3579e + 09 trees for only 10 concepts. We will 
now describe how to represent and estimate distributions over 
such a large number of trees. 

Assume that there are N concept nodes indexed from 1 
to N, a fixed root node indexed by 0, and a set of possi¬ 
ble directed edges £ = {eo,i,..., ..., cat,at} indexed 

by where i j. A hierarchy T C 5 is a set of A/ 

edges, which form a valid tree rooted at the 0-th node (we use 
the terms hierarchy and tree interchangeably). All valid hier¬ 
archies form the sample space T = {Ti,..., Tm}. The prior 
distribution tt^ over T is set to be the uniform distribution. 

Due to the intractable number of trees, we use a compact 
model to represent distributions over trees: 


P(T\W) = 


Ue 


, eT 


Z{W) 


( 1 ) 


where Wij is a non-negative weight for the edge e* j, and 
z{w) = E T'er He. gt' ^be partition function. 

Given W, inference is very efficient. For example, Z{W) 
can be analytical ly computed, utilizing the Matrix Theo¬ 
rem i Tutte, 19^ . This way, we can also analytically com¬ 
pute marginal probabilities over the edges, i.e., P{eij). The 
tree with the highest probability can be found via the famous 
algorithm of |Chu and Liu|p965| . A uniform prior is incorpo¬ 
rated by initially setting all Wij to be the same positive value. 

Our system maintains a posterior distribution over hierar¬ 
chies. Given a sequence of questions regarding the structure 

































































of the target hierarchy ^... , along with the an¬ 
swers a^,..., ... the posterior P{T\W^^^) at time t is 

obtained by Bayesian inference 


To see, let pij denote a path question and aij G {0,1} be 
the answer for pij. aij = 1 indicates a worker believes there 
is a path from node i to j. The likelihood function is 


P(T|VFW) oc P(r|VF(*“i))/(aW|r). ( 2 ) 

Hereby, /(a(*^|T) is the likelihood of obtaining answer 
given a tree T, specified below to simplify the notation. 

So far, we have not made any assumptions about the form 
of questions asked. Since our system works with non-expert 
workers, the questions should be as simple as possible. As 
discussed above, we resort to questions that only specify the 
relationship between pairs of concepts. We will discuss dif¬ 
ferent options in the following sections. 


3.2 Edge Questions 

Since a hierarchy is specified by a set of edges, one way 
to ask questions could be to ask workers about immediate 
parent-child relationships between concepts, which we call 
edge questions. Answers to edge questions are highly infor¬ 
mative, since they provide direct information about whether 
there is an edge between two concepts in the target hierarchy. 

Let Ci^j denote the question of whether there is an edge 
between node i and j, and Oi^ G {0,1} denote the answer 
for ei^j. Oij = 1 indicates a worker believes there is an edge 
from node ito j, otherwise there is no edge. The likelihood 
function for edge questions is defined as follows: 





(1 — ^*>7 if eij G T 

7^^’^ (1 — otherwise 


(3) 


where 7 is the noise rate for wrong answers. Substituting © 
into 0 leads to an analytic form to update edge weights: 




if z' = z A f = j 

otherwise 

* iJ 


(4) 


An edge question will only affect weights for that edge. 
Unfortunately, correctly answering such questions is difficult 
and requires global knowledge of the complete set (and gran¬ 
ularity) of concepts. For instance, while the statement “Is or¬ 
ange a direct child of fruit in a food item hierarchy?” might 
be true for some concept sets, it is not correct in a hierarchy 
that also contains the more specific concept citrus fruit, since 
it separates orange from fruit (see also Fig.[^. 

3.3 Path Questions 

To avoid the shortcomings of edge questions, our system re¬ 
sorts to asking less informative questions relating to general, 
ascendant-descendant relationships between concepts. These 
path questions only provide information about the existence 
of directed paths between two concepts and are thus, cru¬ 
cially, independent of the set of available concepts. For in¬ 
stance, the path question “Is orange a type of fruitT is true 
independent of the existence of the concept citrus fruit. While 
such path questions are easier to answer, they are more chal¬ 
lenging to use when estimating the distribution over hierar¬ 
chies. 


fMT) 


(1 — 7)^^’J7^ if pi^j G T 

^cLij otherwise 


(5) 


where pi^ G T simply checks whether the path j is con¬ 
tained in the tree T. 


Unfortunately, the likelihood function for path questions is 
not conjugate of the prior. Therefore, there is no analytic form 
to update weights. Instead, we update the weight matrix by 
performing approximate inference. To be mor e specific, we 


find a m* by minimizing the KL-divergence iKullback and 


Leibler, 1951| between P(T|IU*) and the true posterior: 


W* = wgmmKL{P{T\W^^'^)\\P{T\W)) (6) 

w 

It can be shown that minimizing the KL-divergence can be 
achieved by minimizing the following loss function 


L{W) = -^P{T\W^^'^)\ogP{T\W). (7) 

Ter 

Directly computing (|7]) involves enumerating all trees in T, 
and is therefore intractable. Instead, we use a Monte Carlo 
method to estimate Q, and minimize the estimated loss to 
update W. To be more specific, we will generate i.i.d. sam¬ 
ples T = (Ti,... ,Tm) from P{T\W^^^), which defines an 
empirical distribution of the samples, with estimated loss 


Li{W) = -J2HT)\ogP{T\W), (8) 

Tef 

the negative log-likelihood of P(T|IU) under the samples. 


Sampling Hierarchies from the Posterior 

If a weight matrix W is given, sampling hierarchies from the 
distribution defined in Q can be achieved efficiently, for ex¬ 
ample, using a loop-avoidin g random walk on a graph with 
W as the adjacency matrix | [Wilson, 199^ . Therefore, we 
can sample hierarchies from the prior P(T|IU^^“^^). Notic¬ 
ing that the posterior defined in Q is a weighted version of 
P(P| ), we can generate samples for the empirical dis¬ 

tribution TT via importance sampling, that is, by reweighing 
the samples from P(T|IU^^“^^) with the likelihood function 
|T) as importance weights. 


Regularization 

Since we only get samples from P(T|IU^^^), the estimate 
of ^ can be inaccurate. To avoid overfitting to the sample, 
we add an -regularization term to the objective function. 
We also optimize A = log W rather than W so as to simplify 
notation. The final objective is as follows: 


Lf(AW) = -^ 7f(*)(T)logP(T|A(*)) + (9) 

Ter ip 








Algorithm 1 Weight Updating Algorithm 


Input: an answer thr for stopping criterion 

Non-negative regularization parameters /3 
Output: that minimizes (|^ 

Generate samples T{,..., from 

Use importance sampling to get empirical distribution n 

Initialize = 0, / = 1. 


repeat 


For each (i, j), set 6ij = arg min (19); 
Update + A; 

/ = / + !; 


until |A| < thr 
return = exp(A*^^^) 


Optimization Algorithm 

We iteratively adjust A to minimize (|^. At each iteration, the 
algorithm adds A to the original A, resulting in A' = A + A. 
We optimize A to minimize an upper bound on the change in 
Lf, given by 

+ lp(e,,,|A)(e^^^^^' - 1) 


Theorem 2. Suppose m samples tt are obtained from any 
tree distribution tt. Let A minimize the regularized log loss 
I/~ (A) with /3 = y/\og{N/6)/{m). Then for every A it holds 
with probability at least 1 — 5 that 

L^{A) < L„{A) + 2||A||iv/log(7V/5)/m 

Theorem shows that the difference in performance be- 

'—' 6 

tween the density estimate computed by minimizing w.r.t. L ~ 

and w.r.t. the best approximation to the true posterior be¬ 
comes small rapidly as the number of samples m increases. 

3.4 Active Query Selection 

At each interaction with workers, the system needs to pick a 
question to ask. The naive approach would be to pick ques¬ 
tions randomly. However random questions are usually not 
very informative since they mostly get No answers, while Yes 
answers are more informative about the structure. Instead, we 
propose to select the question p* that maximizes information 
gain over the current distribution 7r^^\ i.e. 

p* = arg max max{H{nliyl), (11) 


+ l^i\K,j \ ~ l^i,il)] + C", ( 10 ) 


where P{eij) = X^Ter-e eT'^(^) empirical 

marginal probability of Cij, and C is a consta nt w.r.t. 6ij. 
The derivation is presented in |Sun et al~ 20 I 5 ) . 

Minimizing the upper bound in W can be done by 
analysing the sign of Xij + 6ij. By some calculus, it can 
be seen that the 5ij minimizing must occur when Sij = 
—Xij, or when 5ij is either 


1 

N 

1 

N 


log 

log 


^(g,,|A) 
(^K,) + /3) 

^(g,,|A) 


if Xij + 6ij > 0, or 

if Xij + 6ij < 0. 


This leaves three choices for each 6ij - we try out each and 
pick the one leading to the best bound. This can be done 
independently per Sij since the objective in ( p^ is separable. 
The full algorithm to optimize A based on a query answer is 
given in Algorithmic 


Theoretical Guarantee 


Even though we onl y minimize a sequ ence of upper bounds, 
we prove that (see I Sun et al, 2015| ) Algorithm [C in fact 
convergences to the true maximum likelihood solution: 


Theorem 1. Assume P is strictly positive. Then Algorithm^ 
produces a sequences A^^^, A^^^,... such that 


lim I/~(A^^^) = minL~(A). 

£^oo A 

Let A be the solution of Algorithm We next show that 
if we generate enough samples m, the loss of the estimate 
A under the true posterior will not be much higher than that 
obtained by any distribution of the form O- 


where i7(') is the entropy and is the posterior dis¬ 

tribution over trees after knowing the answer for pij as Oij. 
Note that this criterion chooses the question with the highest 
information gain using the less informative answer, which we 
found to be more robust than using the expectation over an¬ 
swers. Since we cannot compute the entropy exactly due to 
the size of the sampling space, we reuse the trees from the 
empirical distribution n to estimate information gain. 

3.5 Adding New Nodes to the Hierarchy 

New concepts will sometimes be introduced to a domain. In 
this case, how should the new concept be inserted into an ex¬ 
isting hierarchy? A wasteful way would be to re-build the 
whole hierarchy distribution from scratch using the pipeline 
described in the previous sections. Alternatively, one might 
consider adding a row and column to the current weight ma¬ 
trix and initializing all new entries with the same value. Un¬ 
fortunately, uniform weights do not result in uniform edge 
probabilities and do thus not correctly represent the unin¬ 
formed prior over the new node’s location. 

We instead propose a method to estimate the weight matrix 
W that correctly reflects uncertainty over the current tree and 
the location of the new node. After building a hierarchy for 
N-yl nodes, the learned weight matrix is denoted slsWn- We 
can sample a sequence of trees (Ti,..., T^) from the distri¬ 
bution P{T\Wn). Now we want to insert a new node into the 
hierarchy. Since there is no prior information about the posi¬ 
tion of the new node, we generate a new set of trees by insert¬ 
ing this node into any location in each tree in (Ti,..., T^). 
This new set represents a sample distribution that preserves 
the information of the previous distribution while not assum¬ 
ing anything about the new node. The weight matrix 
that minimizes KL-divergence to this sample can then be es¬ 
timated using the method described in Section [33) 

















Figure 1: Weight estimation performance for hierarchies with 20 
nodes. X-axis is the number of samples given to the algorithm, and 
/3 is the regularization coefficient. 

4 Experiments 

In our experiments, we evaluate four aspects of our approach: 
(1) the performance of approximate inference to estimate 
the weights representing distributions over trees; (2) the effi¬ 
ciency of active vs. random query strategies in selecting ques¬ 
tions; (3) comparison to existing work; and (4) the ability to 
build hierarchies for diverse application domains. 

4.1 Sample-based Weight Estimation 

To evaluate the ability of Algorithm to estimate a weight 
matrix based on samples generated from a distribution over 
trees we proceeded as follows. We first sample a “ground 
truth” weight matrix W, then sample trees according to that 
weight matrix, followed by estimating the weight matrix W* 
using Algorithmic and finally evaluate the quality of the es¬ 
timated matrix. To do so, we sample an additional test set of 
trees from P{T\W) and compute the log-likelihood of these 
trees given W*, where P(T| W*) is defined in For cali¬ 
bration purpose, we also compute the log-likelihood of these 
trees under the ground truth specified hy P{T\W). 

Fig.[Cshows the performance for N = 20 nodes, using dif¬ 
ferent values for the regularization coefficient (3 and sample 
size m. Each data point is an average over 100 runs on dif¬ 
ferent weights sampled from a Dirichlet distribution (to give 
an intuition for the complexity of the investigated distribu¬ 
tions, when sampling 1 Million trees according to one of the 
weights, we typically get about 900,000 distinct trees). The 
red line on top is the ground truth log-likelihood. As can be 
seen, the algorithm always converges to the ground truth as 
the number of samples increases. With an appropriate setting 
of /3, the proposed method requires about 10,000 samples to 
achieve a log-likelihood that is close to the ground truth. We 
also tested different tree sizes N, and got quite similar per¬ 
formance. Overall, we found that [3 = 0.01 works robustly 
across different N and use that value in all the following ex¬ 
periments. 

4.2 Active vs. Random Queries 

To evaluate the ability of our technique to recover the correct 
hierarchy, we artificially generate trees and test the perfor¬ 
mance for different tree sizes, different noise rates for path 


queries, and active vs. random path queries. To simulate 
a worker’s response to a query, we first check whether the 
query path is part of the ground truth tree, and then flip the 
answer randomly using a pre-set noise rate 7. To evaluate 
how well our approach estimates the ground truth tree, we 
use the marginal likelihood of the tree edges and compute 
the Area Under the Curve (AUC) using different thresholds 
on the likelihood to predict the existence or absence of an 
edge in the ground truth tree. The marginal likelihood of an 
edge, P(eij|lU) = X^TeTe ^(^1^)’ 
in cl osed form ba sed on the conclusion of the Matrix Theo¬ 
rem | ]Tutte, 198^ . We also tested different evaluation mea¬ 
sures, such as the similarity between the MAP tree and the 
ground truth tree, and found them to all behave similarly. 

Different sized trees of {5, 10, 15} nodes are tested to see 
how the method performs as the problem gets larger. We also 
test different noise rates, including 0%, 5%, and 10% to ver¬ 
ify the robustness of the method. The number of samples for 
updating the weight matrix is flxed to 10, 000 across all ex¬ 
periments. For each setting, 10 different random ground truth 
trees are generated. The average results are reported in Fig.|^ 
The X-axis is the number of questions asked, and the Y-axis is 
the AUC. AUC = 1 means that all edges of the ground truth 
tree have higher probabilities than any other edges according 
to the estimated distribution over trees. 

As can be seen in the flgures, active queries always recover 
the hierarchy more efficiently than their random counterparts 
(random queries are generated by randomly choosing a pair 
of nodes). If there is no noise in the answers, our approach al¬ 
ways recovers the ground truth hierarchy, despite the sample- 
based weight update. Note that, in the noise-free setting, the 
exact hierarchy can be perfectly recovered by querying all N‘^ 
pairs of concepts. While the random strategy typically re¬ 
quires about twice that number to recover the hierarchy, our 
proposed active query strategy always recovers the ground 
truth tree using less than N‘^ samples. 

As N gets larger, the difference between active and random 
queries becomes more signiflcant. While our active strategy 
always recovers the ground truth tree, the random query strat¬ 
egy does not converge for trees of size 15 if the answers are 
noisy. This is due to insufficient samples when updating the 
weights, and because random queries are frequently answered 
with No, which provides little information. The active ap¬ 
proach, on the other hand, uses the current tree distribution 
to determine the most informative query and generates more 
peaked distributions over trees, which can be estimated more 
robustly with our sampling technique. As an indication of 
this, for trees of size 15 and noise-free answers, 141 out of 
the 200 first active queries are answered with “yes”, while 
this is the case for only 54 random queries. 


4.3 Comparison to Existing Work 

We co mpare our method wi th the most releva nt systems Del¬ 
uge jBragg et al, 2013| and CASCADE |Chilton et al. 


2013| , which also use crowdsourcing to build hierarchies. 


Cascade builds hierarchies based on multi-label catego¬ 
rization, and Deluge improves the multi-label classification 
performance of CASCADE. We wi ll thus compare to Del¬ 
uge, using their evaluation dataset I Bragg et al, lm3\ . This 


























Figure 2: Experimental results comparing active query (solid lines) and random query (dashed lines) strategies for tree sizes ranging from 
(left) 5 nodes to (right) 15 nodes, using three different noise rates for answers to questions. 


datas et has 33 labels that a re part of the fine-grained entity 
tags |Ling and Weild, 2012| . The WordNet hierarchy is used 
as the ground truth hierarchy. 

Deluge queries many items for each label from a knowl¬ 
edge base, randomly selects a subset of 100 items, labels 
items with multiple labels using crowdsourcing, and then 
builds a hierarchy using the label co-occurrence. To classify 
items into multi-labels, it asks workers to vote for questions, 
which are binary judgements about whether an item belongs 
to a category. Deluge does active query selection based on 
the information gain, and considers the label correlation to 
aggregate the votes and build a hierarchy. We use the code 
and parameter settings provided by the authors of DELUGE. 

We compare the performance of our method to Deluge 
using different amounts of votes. We compare the follow¬ 
ing settings: 1) Both methods use 1,600 votes; 2) Deluge 
uses 49,500 votes and our method uses 6,000 votes. Eor the 
first setting, we pick 1,600 votes for both, as suggested by the 
authors because Deluge’s performance saturates after that 
many votes. In the second setting, we compute the results of 
using all the votes collected in the dataset to see the best per¬ 
formance of Deluge. We choose 6,000 votes for our method 
because its performance becomes fiat after that. 

We compare both methods using AUC as the evaluation 
criterion. Using 1,600 votes, our method achieves a value of 
0.82, which is slightly better than Deluge with an AUC of 
0.79. However, Deluge does not improve significantly be¬ 
yond that point, reaching an AUC of 0.82 after 49,500 votes. 
Our approach, on the other hand, keeps on improving its ac¬ 
curacy and reaches an AUC of 0.97 after only 6,000 queries. 
This indicates that, using our approach, non-expert workers 
can achieve performance very close to that achieved by ex¬ 
perts (AUC =1). Eurthermore, in contrast to our method, 
Deluge does not represent uncertainty over hierarchies and 
requires items for each label. 

4.4 Real World Applications 

In these experiments we provide examples demonstrating that 
our method can be applied to different tasks using AMT. The 
following pipeline is followed for all three application do¬ 
mains: collect the set of concepts; design the “path question” 
to ask; collect multiple answers for all possible “path ques¬ 


tions”; estimate hierarchies using our approach. Collecting 
answers for all possible questions enabled us to test different 
settings and methods without collecting new data for each 
experiment. AMT is used to gather answers for “path ques¬ 
tions”. The process for different domains is almost identical: 
We ask workers to answer “true-or-false” questions regarding 
a path in a hierarchy. Our method is able to consider the noise 
rate of workers. We estimate this by gathering answers from 
8 workers for each question, then take the majority vote as 
the answer, and use all answers to determine the noise ratio 
for that question. Note that noise ratios not only capture the 
inherent noise in using AMT, but also the uncertainty of peo¬ 
ple about the relationship between concepts. 5 different path 
questions are put into one Human Intelligence Task (HIT). 
Each HIT costs $0.04. The average time for a worker to fin¬ 
ish one HIT is about 4 seconds. 

The process of building the hierarchies is divided into two 
consecutive phases. In the first phase, a distribution is built 
using a subset of the concepts. In the second phase, we use 
the process of inserting new concepts into the hierarchy, until 
all concepts are represented. Eor the body part dataset, we 
randomly chose 10 concepts belonging to the first phase. Eor 
online Amazon shopping and RGBD object data, the initial 
set is decided by thresholding the frequency of the words used 
by workers to tag images (15 nodes for Amazon objects and 
23 nodes for RGBD objects). The learned MAP hierarchies 
are shown in Eig.[^ 

Representing Body Parts 

Here, we want to build a hierarchy to visualize the “is a part 
of” relationship between body parts. The set of body part 
words are collected using Google search. An example path 
question would be “Is ear part of upper bodyT. The MAP 
tree after asking 2,000 questions is shown in the left panel of 
Pig. As can be seen, the overall structure agrees very well 
with people’s common sense of the human body structure. 
Some of the nodes in the tree are shown in red, indicating 
edges whose marginal probability is below 0.75. These edges 
also reflect people’s uncertainty in the concept hierarchy. Eor 
example, it is not obvious whether ear should be part of the 
head or face, the second most likely placement. Similarly, it 
is not clear for people whether ankle should be part of foot or 
























Figure 3: MAP hierarchies representing body parts, amazon kitchen products, and food items (left to right). Red nodes indicate items for 
which the parent edge has high uncertainty (marginal probability below 0.75). Videos showing the whole process of hierarchy generation can 
be found on our project page: http : //rse-lab . cs .Washington . edu/pro jects/learn-taxonomies. 


leg, and whether wrist should be part of arm or hand. 

An obvious mistake made by the system is that ring finger 
and thumb are connected to the upper body rather than hand. 
This is caused by questions such as “Is ring finger part of 
armT\ which only 1 out of 8 workers answered with yes. 
Hence the concept of ring finger or thumb is not placed into 
a position below arm. 


Online Shopping Catalogue 

The second task is to arrange kitchen products taken from 
the Amazon website. There are some existing k itchen hi- 
erarchies, for example, the Amazon hierarchy (Amazon, 
2m5) . However, the words used by Amazon, for exam¬ 
ple, “Tools-and-Gadgets”, might be quite confusing for cus¬ 
tomers. Therefore, we collected the set of words used by 
workers in searching for products. We provide AMT work¬ 
ers images of products, and ask them to write down words 
they would like to see in navigating kitchen products. Some 
basic preprocessing is done to merge plural and singular of 
the same words, remove obviously wrong words, and remove 
tags used less than 5 times by workers because they might 
be some nicknames used by a particular person. We also re¬ 
move the word “set”, because it is used by workers to refer 
to a “collection of things” (e.g., pots set, knives set), but not 
related to the type of products shown in the pictures. The path 
questions have the form “Would you try to find pots under the 
category of kitchenwareT The learned MAP tree is shown in 
the middle panel of Fig. 


Food Item Names 

This experiment investigates le arning a hierarchy over food 
items used in a robotics setting I Lai et al, 201 la I, where the 
goal is to learn names people use in a natural setting to refer 
to objects. Here, AM T workers were sh own images from the 
RGBD object dataset I Lai et al, 201 la| and asked to provide 
names they would use to refer to these objects. Some basic 
pre-processing was done to remove noisy tags and highly in¬ 
frequent words. The path questions for this domain are of the 
form “Is it correct to say all apples diV^fruitsT. 

The MAP tree is shown in the right panel of Fig.|^ Again, 
while the tree captures the correct hierarchy mostly, high 


uncertainty items provide interesting insights. For instance, 
tomato is classified sls fruit in the MAP tree, but also has a sig¬ 
nificant probability of being a vegetable, indicating people’s 
uncertainty, or disagreement, about this concept. Meanwhile, 
the crowd of workers was able to uncover very non-obvious 
relationships such as allium is a kind of bulb. 

5 Conclusion 

We introduced an approach for learning hierarchies over 
concepts using crowdsourcing. Our approach incorporates 
simple questions that can be answered by non-experts 
without global knowledge of the concept domain. To deal 
with the inherent noise in crowdsourced information and 
with people’s uncertainty, and possible disagreement, about 
hierarchical relationships, we develop a Bayesian framework 
for estimating posterior distributions over hierarchies. When 
new answers become available, these distributions are up¬ 
dated efficiently using a sampling-based approximation for 
the intractably large set of possible hierarchies. The Bayesian 
treatment also allows us to actively generate queries that 
are most informative given the current uncertainty. New 
concepts can be added to the hierarchy at any point in time, 
automatically triggering queries that enable the correct 
placement of these concepts. It should also be noted that our 
approach lends itself naturally to manual correction of errors 
in an estimated hierarchy: by setting the weights inconsistent 
with a manual annotation to zero, the posterior over trees 
automatically adjusts to respect this constraint. 

We investigated several aspects of our framework and 
demonstrated that it is able to recover quite good hierar¬ 
chies for real world concepts using AMT. Importantly, by 
reasoning about uncertainty over hierarchies, our approach 
is able to unveil confusion of non-experts over concepts, 
such as whether tomato is a fruit or vegetable, or whether 
the wrist is part of a person’s arm or hand. We believe that 
these abilities are extremely useful for applications where 
hierarchies should refiect the knowledge or expectations of 
regular users, rather than domain experts. Example appli¬ 
cations could be search engines for products or restaurants, 
or robots interacting with people who use various terms to 
























relate to objects in the world. Investigating such use cases 
is an interesting avenue for future research. Other possible 
future directions include explicit treatment of synonyms and 
automatic detection of inconsistencies. 
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6 Derivation of the Objective Function 
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14 1 uses the definition of P(T|A). 
15 1 uses the fact that 
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where P{eij) is the marginal likelihood of the edge eij. 

To get (181 ^ (191, we use an inequality that, if Xj G M and pj > 0 with pj < 1, then 

exp(EPia;j) - 1 < Y^Pjie^^ - 1). 
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(11) ^ (12) is true because log(l + x) < x, Vx > —1, and 
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7 Minimization of ( |l9l > 

Case 1: 5i^j — —Xij. Such that becomes 

- 1 ) 

Case 2: Xij Si j > 0. Such that becomes 

E - 1) + /55i.i 


Take derivative of f2b[, and set it to be 0: 


the solution is 


-P{eij) + Pieij)e^^'’^ + /3 = 0, 
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Case 3: Xij Si j < 0. Such that becomes 
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Take derivative of and set it to be 0: 

-P{eij) + P(eij)e^^'’i - 13 = 0, 


the solution is 
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8 Proof of Theorem 1 

Theorem 3. Assume /3 is strictly positive. Then Algorithm 1 produces a sequences ,... such that 

lim Lf(A(^))=minLf(A). 

i^oo A 

Proof. First let us define A+ and A“ in terms of A as follows: for each (i, j), if Xi^j > 0, then X'l - = Xi^j and X~ - = 0, and if Ai,j 
then X'l^- — 0 and X~ - — —Xij. A'+, A'“, etc. are defined analogously. 

Let Fij denote the (z, j) component in For any A and A, we have the following: 


|A + 5| - |A| = min{(5+ +<5“|(5+ > -A+,<5“ > -A",<5+ -<5“ = <5} 

Plugging into the definition of Fij gives: 

Fij{A,A) = — SijP{eij) + '^P{^i,j){^ — 1) + l3{\Xij + Sij \ — |Aij| 


where 


= min{Gi,^ (A, A+, A“)|(5+- > -A+-,(5,“- > X~j,SP - 
GiAK, A+, A-) = ( 5 -. - 5tAP{eiA + - 1 ) + / 3 ( 5 + + 5^) 


So, by |T^ , 
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Note that Gij (A, 0,0) = 0, so none of the terms in this sum can be positive. So the A^’s have a convergent subsequence converging to 
some A such that 


A+, A-)|5+ > > A-,.,5+- - Jr. = S^j}} = 0. 


It is easy to verify that minimizing L? (A) is the dual problem of the following convex program: 
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We will show that A"*" and A together with P{T\A) satisfy the KKT condition of the previous convex program, and thus form a solution 
to the prime problem as well as to the dual, the minimization of L?. For P(T| A), these conditions work out to be the following for all (i, j): 


A+■ > 0, P{eij) - P{eij) < 0, A+ (P(ei,,) - Pietj) - 0) = 0 

Ajj- > “ Pi^ij) ^ /(i A^j(F’(eij') — P{ei,j) — 0) = 0 


(40) 

(41) 


Since Gij{A, 0,0) = 0, by (351, if Xij > 0 then Gij{A, A^, 0) is nonnegative in a neighborhood of = 0, and so has a local 
minimum at this point. Such that 


aG^,,-(A,A+,0) 


dsr. 


s+ .=0 — + /^ — 0. 


(42) 


If = 0, then (351 gives that Gij{A, 0,0) = 0 for > 0. Thus dGij{A, A+, 0) cannot be decreasing at = 0. Therefore, the 
partial derivative above must be nonnegative. Altogether, these prove ( [40| . ( [4T] ) can be proved analogously. 

As a whole, we proved that 


lim = L?{A) = minL§(A). 

i^OO A 
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9 Proof of Theorem 2 

Lemma 1. Suppose samples n are obtained from any tree distribution tt. Then 
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Lemma 2. Suppose samples tt are obtained from any tree distribution tt. Assume that \P{ei^j) — P{ei^j)\ < ^i,j Let A minimize 

the regularized log loss (A). The for every A it holds that 

N N 

L,(A) <L,(A) + 2 EE /5|Vj|. 

i=0 j = l 






Proof. 


L4A) <L*(A) + y^/?|Ai,,-| = Lf(A) 

(45) 

<Lf(A) = L*(A) + y^/?|Ai,,-| 
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(47) 
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( 451 to ( 461 is tree because of the optimality of A. ( 461 to ( £7 1 follow from Lemma[^ 

Theorem 4. Suppose m samples tt are obtained from any tree distribution tt. Let A minimize the regularized log loss L ~ (A) with /3 = 
^y\og{N/6)/m. Then for every A it holds with probability at least 1 — (5 that 


L^{A) < L,(A) + 2||A||iVlog(iV/(5)/m 

Proof By Hoeffding’s inequality, for a fixed pair of (i, j), the probability that P{ei^j) — P{ei^j) exceeds jS is at most m _ ^ gy 
the union bound, the probability of this happening for any pair of (i, j) is at most 5. Then the theorem follows from Lemma[^ □ 








